{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 贝叶斯决策论和朴素贝叶斯分类器\n",
    "\n",
    "徐迪深 浙江工商大学 金融学院 CFA1901"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 贝叶斯决策论\n",
    "\n",
    "​周末和朋友在一个大型商业综合体逛街吃饭，选择太多了，去吃什么好呢？好朋友聚在一起不容易，因此吃饭的选择会比较重要。如何选择呢？大家会提出各自建议：选一个口味符合的，选一个装修漂亮的，选一个生意兴旺的，选一个在美团上有很多介绍的，选一个有朋友去过的，诸如此类的建议你肯定需要尽可能的采纳，而且希望得到一个完美的交集。但是显然把所有的建议都遍历一遍的话，估计大家都不等不到吃饭了，而且你能找到收集全所有的建议吗？显然不能。但是选择还是要做的，饭总归要吃的。因此我们的决策是**基于一个不完备的信息集的**。此外，我们的决策过程是将有效的建议，一个一个根据重要性来尝试的，比如，我们认为一个好的餐馆，第一个也是最重要的特征是在网络平台上有好口碑，第二个次重要的特征是当天的生意好、客人多，第三重要的是我们当中有人去过的...... **我们的决策是一个不断的，但同时是有序的数据采集和信息分析的过程**。\n",
    "\n",
    "我们知道生活中绝大多数决策面临的信息都是不全的，我们只有有限的信息。在无法得到全面的信息的情况下，我们就在信息有限的情况下，尽可能做出一个好的预测。这个每天都在进行的过程就是大名鼎鼎的贝叶斯理论(Bayes theorem)，或者是贝叶斯推断(Bayes inference)，贝叶斯法则(Bayes rule).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在可以设想一个场景。\n",
    "\n",
    "假如你想要去学校附近的宝龙广场的一个餐馆a吃饭。你去查看了网络上对这家餐馆的介绍，比如口味，装修，去的人多不多，想根据这些信息判断这个餐馆好不好。那么在贝叶斯决策论中，我们将诸如**口味，装修，去的人多不多**这些信息称为特征，而餐馆**好**和**不好**称为类别。\n",
    "\n",
    "一个餐馆的口味，装修，去的人多不多都是餐馆的公开信息。那么基于这些信息，餐馆a是好餐馆（不好的餐馆也一样）的概率可以用条件概率表示：\n",
    "\n",
    "$$\n",
    "P(好餐馆|口味，装修，客人的多少)\\qquad(1)\n",
    "$$\n",
    "\n",
    "$(1)$式可以用贝叶斯公式展开，有\n",
    "\n",
    "$$\n",
    "P(好餐馆|口味，装修，客人的多少)=P(好餐馆)\\frac{P(口味，装修，客流量|好餐馆)}{P(口味，装修，客人的多少)}\\qquad(2)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "显然我们希望$(1)$能越大越好。\n",
    "\n",
    "而将$(1)$得到$(2)$之后，可以发现$𝑃(好餐馆)$（我们也称为先验（prior）概率）是一个固定的值，同理，$P(口味，装修，客人的多少)$也是常数。在实际计算中，我们可以根据大数定理，用频率来估计概率。\n",
    "\n",
    "所以当$𝑃(口味，装修，客人的多少|好餐馆)$最大时，$(1)$也会最大。$𝑃(口味，装修，客流量|好餐馆)$这个我们叫做似然（likelihood）概率。而最大化似然概率我们最常用的就是极大似然估计。\n",
    "\n",
    "$(2)$也给了我们一个启示：\n",
    "\n",
    "$$\n",
    "\\frac{P(口味，装修，客人的多少|好餐馆)}{P(口味，装修，客流量)}\\qquad(3)\n",
    "$$\n",
    "\n",
    "假如$(3)$能够大于1，就意味着比起我们的总体样本，那就意味着这一类餐馆更有可能是好餐馆。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 极大似然估计\n",
    "\n",
    "回到上面一节我们的思路：想要$P(好餐馆|口味，装修，客人的多少)$更大，那么$P(口味，装修，客人的多少|好餐馆)$就要更大\n",
    "\n",
    "现在的问题是如何最大化似然概率。我们常用的方法就是极大似然估计。\n",
    "\n",
    "以极大似然估计$P(口味，装修，客人的多少|好餐馆)$为例。\n",
    "\n",
    "现在你在网上看到了N个人对于餐馆的评论，我们对每个人标记为$x_i$($i \\in N$),其中$x_i$={$口味=偏咸/偏甜，装修=豪华/简单，客流量=多/少$}，以及每个人的评价$y_i$={$好/不好$}（$i \\in N$）\n",
    "\n",
    "那么，就可以由这些人的评论得到如下一组概率\n",
    "\n",
    "$$\n",
    "P([口味=偏甜/偏咸，装修=豪华/简单，客流量=多/少]的各种情况的组合|y=好餐馆)\n",
    "$$\n",
    "\n",
    "实际上最后就是想找到一个最符合好餐厅的特征（例如我喜欢口味偏咸，装修简单，客流量少的餐厅，那么对于我来说$P(口味=咸，装修=简单，客流量=少|好参观)$这个概率就是最大的似然概率）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 朴素贝叶斯分类器\n",
    "\n",
    "贝叶斯分类器有很多种，而朴素贝叶斯分类器是在所有特征都独立的假设进行分类的一种算法。至于为什么要做独立的假设（往往这个假设并不符合实际），是因为独立的假设使得计算简单很多，而对于预测的准确程度又不会损失到不能接受的程度。\n",
    "\n",
    "那么，上面的$(2)$式就可以写成\n",
    "\n",
    "$$\n",
    "P(好餐馆|口味，装修，客人的多少)=P(好餐馆)\\frac{P(口味|好餐馆)P(装修|好餐馆)P(客流量|好餐馆)}{P(口味)P(装修)P(客人的多少)}\\qquad(5)\n",
    "$$\n",
    "\n",
    "这样比起原来的式子，明显计算方面简单许多。但是，贝叶斯分类器也并不是只有朴素贝叶斯分类器一种，还有一类半朴素贝叶斯分类器，相对朴素的假设则放宽了许多，精度提高的同时计算也更加复杂。分类器的选用具体要根据任务的要求和背景决定"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 拉普拉斯修正\n",
    "\n",
    "有时候我们会遇到这样的情况。\n",
    "\n",
    "还是以判断餐馆a的好坏为例：你现在拿到了网上各个网友对餐馆a的口味，装修，客流量和对餐馆a的评价（好或者不好），但是你还多关注了一个特征：是否有音乐。但网上评价的人并不在意这个，所以没有人提起这个餐馆有没有音乐。于是有一个特征的特征值都缺失了。那么你在具体计算先验概率和似然概率甚至在计算贝叶斯公式中的分母（$P(音乐)$）的时候，这个概率为0（但分母肯定不可以为0）。\n",
    "\n",
    "你可以选择丢弃这个特征。但是你仍然很在意这个餐厅是否有音乐，所以并不想丢弃，那么该如何解决这个问题？\n",
    "\n",
    "拉普拉斯修正就是一个解决方法。\n",
    "\n",
    "拉普拉斯修正核心要保证的点是两个：第一个是让计算不违背数学的基本规律，第二个是不能让修正后的概率和修正之前有不可接受的差异（比如修正前我们的$P(音乐)$为0但修正后突然增加到0.8，这样会使得结果产生重大误差。）\n",
    "\n",
    "具体做法如下：\n",
    "\n",
    "还是以我们上面要判断餐馆a的好坏为例。其实在你的训练样本足够大的情况下，我们原来的对音乐取餐厅好坏的条件的似然概率计算是这样的：\n",
    "\n",
    "$$\n",
    "P(音乐=有|餐厅=好)=\\frac{有音乐且为好餐厅的个数=0}{好餐厅的个数}=0\n",
    "$$\n",
    "\n",
    "那么拉普拉斯修正就可以这样修正：\n",
    "\n",
    "$$\n",
    "P'(音乐=有|餐厅=好)=\\frac{有音乐且为好餐厅的个数+1}{好餐厅的个数+音乐这个特征的取值（有/无，所以是2）}\\neq 0\n",
    "$$\n",
    "\n",
    "然而拉普拉斯修正还不能改变信息集本身的特点。举个例子，假如你没有做拉普拉斯修正之前，你调查到的信息表示宝龙广场的特点是口味偏咸，装修简单，客流量大和没有音乐的餐馆更有可能是好餐馆，然而修正之后信息反应的是口味偏甜，装修豪华的，客流量小且有音乐的餐馆更有可能是好餐馆，那么显然这个修正是一个很失败的修正。\n",
    "\n",
    "所以既然上面音乐对餐厅好坏取条件的似然概率做了修正，那么其它特征都要相应做修正。\n",
    "\n",
    "先验概率的修正：\n",
    "$$\n",
    "P'(餐厅=好)=\\frac{好餐厅的个数+1}{所有餐厅的个数+所有特征的个数（口味，装修，客流量，音乐所以总共4个）}\n",
    "$$\n",
    "\n",
    "似然概率的修正，以口味的似然概率为例：\n",
    "\n",
    "$$\n",
    "P'(口味=偏咸|餐厅=好)=\\frac{口味偏咸且是好餐厅的个数+1}{好餐厅的个数+口味这个特征可能取值的个数（偏咸/偏甜，所以是2）}\n",
    "$$\n",
    "\n",
    "这样就能保证不舍弃音乐这个特征，且不改变原来训练样本（经验）的信息情况下，对目标进行分类（预测）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 贝叶斯分类器在金融的应用：运用财务指标对上市公司的质量进行判断"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我国在1998年在A股市场实行了特别风险提示（Special Treatment）制度来完善我国股市在对于财务状况不佳或其他经营状况不佳的上市公司的处理制度。对于连续两个财年利润为负或者其它经营状况不佳的情况的上市公司，证监会要求这类上市公司，其股票名称要在前面冠名\"ST\"并单独披露信息，每日股价涨跌幅限制在±5%。\n",
    "\n",
    "E.Altman（1986）对美国的破产清算的上市公司进行研究提出了基于财务指标的用于判断上市公司是否可能在未来陷入财务困境的Z值模型。我们可以通过python程序来编写一个朴素贝叶斯分类器来观察朴素贝叶斯分类器在判断我国上市困境的财务困境是否拥有更好的预测效果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import math"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>证券代码</th>\n",
       "      <th>证券名称</th>\n",
       "      <th>X1</th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>X5</th>\n",
       "      <th>Z值</th>\n",
       "      <th>描述</th>\n",
       "      <th>类别</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>600077.SH</td>\n",
       "      <td>宋都股份</td>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>5.9789</td>\n",
       "      <td>良好</td>\n",
       "      <td>ST</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>600091.SH</td>\n",
       "      <td>*ST明科</td>\n",
       "      <td>5</td>\n",
       "      <td>10</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1.0051</td>\n",
       "      <td>堪忧</td>\n",
       "      <td>ST</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>600131.SH</td>\n",
       "      <td>国网信通</td>\n",
       "      <td>5</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1.4975</td>\n",
       "      <td>堪忧</td>\n",
       "      <td>ST</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>600145.SH</td>\n",
       "      <td>*ST新亿</td>\n",
       "      <td>6</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3.3578</td>\n",
       "      <td>良好</td>\n",
       "      <td>ST</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>600179.SH</td>\n",
       "      <td>安通控股</td>\n",
       "      <td>5</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1.3321</td>\n",
       "      <td>堪忧</td>\n",
       "      <td>ST</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>309</th>\n",
       "      <td>002625.SZ</td>\n",
       "      <td>光启技术</td>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>9.2708</td>\n",
       "      <td>良好</td>\n",
       "      <td>非ST</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>310</th>\n",
       "      <td>600486.SH</td>\n",
       "      <td>扬农化工</td>\n",
       "      <td>8</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>5.0731</td>\n",
       "      <td>良好</td>\n",
       "      <td>非ST</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>311</th>\n",
       "      <td>300088.SZ</td>\n",
       "      <td>长信科技</td>\n",
       "      <td>6</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>4.6264</td>\n",
       "      <td>良好</td>\n",
       "      <td>非ST</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>312</th>\n",
       "      <td>600884.SH</td>\n",
       "      <td>杉杉股份</td>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1.8362</td>\n",
       "      <td>不稳定</td>\n",
       "      <td>非ST</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>313</th>\n",
       "      <td>600516.SH</td>\n",
       "      <td>方大炭素</td>\n",
       "      <td>8</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>4.2410</td>\n",
       "      <td>良好</td>\n",
       "      <td>非ST</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>314 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          证券代码   证券名称  X1  X2  X3  X4  X5      Z值   描述   类别\n",
       "0    600077.SH   宋都股份   7  10  10   1   3  5.9789   良好   ST\n",
       "1    600091.SH  *ST明科   5  10   8   1   1 -1.0051   堪忧   ST\n",
       "2    600131.SH   国网信通   5  10  10   1   1  1.4975   堪忧   ST\n",
       "3    600145.SH  *ST新亿   6  10   9   1   1  3.3578   良好   ST\n",
       "4    600179.SH   安通控股   5  10  10   1   3  1.3321   堪忧   ST\n",
       "..         ...    ...  ..  ..  ..  ..  ..     ...  ...  ...\n",
       "309  002625.SZ   光启技术   7  10  10   1   2  9.2708   良好  非ST\n",
       "310  600486.SH   扬农化工   8  10  10   1   2  5.0731   良好  非ST\n",
       "311  300088.SZ   长信科技   6  10  10   1   2  4.6264   良好  非ST\n",
       "312  600884.SH   杉杉股份   7  10  10   1   2  1.8362  不稳定  非ST\n",
       "313  600516.SH   方大炭素   8  10  10   1   1  4.2410   良好  非ST\n",
       "\n",
       "[314 rows x 10 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def get_data(name,file='.xlsx'):\n",
    "    return pd.read_excel(name+file)\n",
    "#我们可以用pandas导入我们的数据\n",
    "#这里我们选取了2011到2012年的ST公司和与其类似的财务状况良好的上市公司\n",
    "df = get_data('train')\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这我们特别说明一下X1到X5分别是什么：\n",
    "X1：净营运资本 / 总资产\n",
    "X2：留存收益 / 总资产\n",
    "X3：息税前收益/ 总资产 \n",
    "X4：优先股和普通股市值 /总负债\n",
    "X5：销售额 / 总资产\n",
    "\n",
    "在原来的Z值模型中每一个财务指标都是一个类型为浮点数的数值，不过我们将每一个特征的数值根据大小划分为4个区间，得到了上面的表格。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "314\n"
     ]
    }
   ],
   "source": [
    "print(len(df))#显示我们的样本总量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "#编写朴素贝叶斯分类器\n",
    "class naive_bayes_classifier(object):\n",
    "    \n",
    "    def __init__(self,feature_name_list,features,labels,label_name,df,laplacian):\n",
    "        #我们依次输入含有所有特征名称的列表，所有类别的列表，类别所在列的名称，和我们导入的数据\n",
    "        self.feature_name_list = feature_name_list\n",
    "        self.features = features\n",
    "        self.labels = labels\n",
    "        self.data = df\n",
    "        self.label_name = label_name\n",
    "        self.likelihood = None\n",
    "        self.prior_probability = None\n",
    "        self.one = None\n",
    "        self.laplacian = laplacian\n",
    "    \n",
    "    #计算先验概率\n",
    "    def compute_prior_probability(self):\n",
    "        prior_probability = {}\n",
    "        classes = self.data[self.label_name]\n",
    "        for name in self.labels:\n",
    "            prior_probability[name] = (len(classes[classes == name])+1*self.laplacian)/(len(self.data)+self.laplacian*len(classes))\n",
    "        print(prior_probability)\n",
    "        self.prior_probability = prior_probability  #先验概率的计算在实际应用中并不复杂\n",
    "                                                      #就是看每个类别在样本中的频率\n",
    "    \n",
    "    #计算似然概率\n",
    "    def compute_likelihood(self): \n",
    "        likelihood = {}\n",
    "        for feature_name in self.feature_name_list:\n",
    "            likelihood[feature_name] = {}\n",
    "            for feature in self.features:\n",
    "                likelihood[feature_name][feature] = {}\n",
    "                for label in self.labels:\n",
    "                    data = df[[feature_name,label_name]]\n",
    "                    #因为我们就需要看一个特征相对于类别的似然概率，自然将特征列和类别列单独拿出会方便很多\n",
    "                    likelihood_nemerator = len(data[(data[feature_name] == feature)&(data[label_name] == label)])+self.laplacian*1\n",
    "                    #这里计算的就是似然概率的分子，计算符合某一个特征且符合某一个类别的样本个数\n",
    "                    likelihood_denominator = len(data[data[label_name] == label])+self.laplacian*len(df[feature_name].unique())\n",
    "                    #这里就是计算分母，计算符合某一个特征的个数\n",
    "                    likelihood[feature_name][feature][label] = likelihood_nemerator/likelihood_denominator\n",
    "        self.likelihood = likelihood\n",
    "        print(likelihood)\n",
    "        #程序的思路是这样：我们想要计算的就是条件概率：如P(X1，ST)/P（ST），我们实际上要做的就是先找出所有公司中的ST的公司\n",
    "        #记录个数（函数len（））\n",
    "        #然后在找出来的公司记录X1很低，低，中，高的公司分别有多少个\n",
    "        #然后相除就可以得到这个条件概率了\n",
    "        \n",
    "    def predict(self,target):\n",
    "        score = self.prior_probability.copy()\n",
    "        information = self.likelihood.copy()\n",
    "        for key1 in score.keys():\n",
    "            for key2 in target.keys():\n",
    "                for value in target.values():\n",
    "                    score[key1] += math.log(information[key2][value][key1])\n",
    "                    #贝叶斯如果目的是为了做预测，那么实际上判断贝叶斯公式的大小就可以了\n",
    "        for key in score.keys():\n",
    "            score[key] = score[key]/sum(score.values())#然而为了能给使用者更直观的感受，我们这里进行归一化让结果以概率的方式展现\n",
    "        print(score)\n",
    "        if score['ST']>score['非ST']:#做比较，朴素贝叶斯分类器会建议概率更大的类别\n",
    "            result = 'ST'\n",
    "        else:\n",
    "            result = '非ST'\n",
    "        return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'ST': 0.2515923566878981, '非ST': 0.2515923566878981}\n"
     ]
    },
    {
     "ename": "AttributeError",
     "evalue": "'NoneType' object has no attribute 'values'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-9-851ef50e34d6>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m      7\u001b[0m \u001b[1;31m#输入参数\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      8\u001b[0m \u001b[0mbayes\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnaive_bayes_classifier\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfeature_name_list\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mfeatures\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mlabels\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mlabel_name\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mdf\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 9\u001b[1;33m \u001b[0msum\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mbayes\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcompute_prior_probability\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     10\u001b[0m \u001b[0mbayes\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcompute_likelihood\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'values'"
     ]
    }
   ],
   "source": [
    "#现在对我们的朴素贝叶斯分类器进行训练\n",
    "features = list(range(1,11))#我们样本中所有的特征值\n",
    "labels = ['ST','非ST']#我们所有的类别\n",
    "label_name = '类别'#类别那一列我们取的名字就叫“类别”\n",
    "feature_name_list = list(df.columns[2:7])#我们的表格中，取出我们要训练的特征列的列名\n",
    "\n",
    "#输入参数\n",
    "bayes = naive_bayes_classifier(feature_name_list,features,labels,label_name,df,True)\n",
    "bayes.compute_prior_probability().values()\n",
    "bayes.compute_likelihood()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#我们从同花顺上2016年的ST股票\n",
    "#下载了对应的z值模型数据来看朴素贝叶斯分类器和z值模型在预测我国上市公司财务困境上的效果\n",
    "ST_test = pd.read_excel('2016ST_test.xlsx')\n",
    "ST_test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(ST_test)#测试集的数量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "not_ST_test = pd.read_excel('2016_not_ST.xlsx')\n",
    "not_ST_test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(not_ST_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#将第二行的数据方进去预测，这一个样本Z值的预测是错误的\n",
    "target = {'X1':1,'X2':8,'X3':6,'X4':1,'X5':5}#可以根据上面给出的数据换一些测试样本进行测试\n",
    "bayes.predict(target)\n",
    "#结果可以看到原本Z值判断错误的样本在朴素贝叶斯分类器下能得到和实际相符的预测结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#建立一个字典存入所有的测试样本\n",
    "test_samples = {}\n",
    "for code in ST_test['证券代码']:\n",
    "    test_samples[code] = {}\n",
    "    for key in ST_test.columns[2:7]:\n",
    "        test_samples[code][key] = ST_test[ST_test['证券代码']==code][key].values[0]\n",
    "#print(test_samples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#建立一个循环看看我们的训练效果\n",
    "p = 0\n",
    "for code in test_samples.keys():\n",
    "    if bayes.predict(test_samples[code])=='ST':\n",
    "        p += 1 \n",
    "    bayes.predict(test_samples[code])\n",
    "    print('-------------------------------')\n",
    "print(str(p/len(ST_test)*100)+'%')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_samples = {}\n",
    "for code in not_ST_test['证券代码']:\n",
    "    test_samples[code] = {}\n",
    "    for key in not_ST_test.columns[2:7]:\n",
    "        test_samples[code][key] = not_ST_test[not_ST_test['证券代码']==code][key].values[0]\n",
    "#print(test_samples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = 0\n",
    "for code in test_samples.keys():\n",
    "    if bayes.predict(test_samples[code])=='非ST':\n",
    "        p += 1 \n",
    "    bayes.predict(test_samples[code])\n",
    "    print('-------------------------------')\n",
    "print(str(p/len(ST_test)*100)+'%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "贝叶斯的奥秘其实可以用直方图展现出来，利用matplotlib包我们可以做出先验概率和似然概率的图"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#以下代码实际上是根据条件对不同特征的公司计数\n",
    "laplacian = True\n",
    "likelihood = {}\n",
    "for feature_name in feature_name_list:\n",
    "    for feature in features:\n",
    "        for label in labels:\n",
    "            tmp_df = df[[feature_name,'类别']]\n",
    "            likelihood[feature_name+str(feature)+label] = \\\n",
    "            (len(tmp_df[(tmp_df[feature_name]==feature)&(tmp_df['类别']==label)])+laplacian*1)/\\\n",
    "            (len(tmp_df[tmp_df[label_name] == label])+laplacian*len(tmp_df[feature_name].unique()))\n",
    "            \n",
    "print(likelihood)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "#下面的代码用来显示中文字体\n",
    "plt.rcParams['font.sans-serif']=['SimHei']    #指定默认字体 SimHei为黑体\n",
    "plt.rcParams['axes.unicode_minus']=False       \n",
    "\n",
    "fig = plt.figure(figsize=(50,50))\n",
    "illustration_prior_probability = fig.add_subplot(221)\n",
    "illustration_prior_probability.hist(df['类别'],density=True)\n",
    "illustration_prior_probability.set_title('先验概率直方图')\n",
    "\n",
    "illustration_likelihood = fig.add_subplot(223)\n",
    "illustration_likelihood.bar(list(likelihood.keys()),list(likelihood.values()),width=1.5)\n",
    "illustration_likelihood.set_title('似然概率直方图')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到，先验概率几乎不承载信息，因为先验概率并没有差异；而相对的似然概率则承载了非常多的信息。\n",
    "\n",
    "实际上我们在获取数据时往往会选择让各个类别的先验概率不要有太大差异。如果先验概率差别很大，1000支股票ST有999个，而非ST只有1个，那么你随机选出的股票只有很小的概率能抽到非ST股票。那么你在训练贝叶斯分类器的时候就意味着它只能学习到单方面的信息，则预测时你的结果几乎都是ST。\n",
    "\n",
    "而我们希望似然概率能在类别上表现出显著的差异则体现在贝叶斯决策本身对信息更新的特点。其实先验概率就好像你刚刚进入市场，你只知道有两种股票：ST股和非ST股，而当你看到了市场公司中的财报，就相当于获得了新的信息，那么假如ST和非ST公司发布的财报数据没有什么区别，那么这些财报对于你的预测就是完全没有意义的，因为你没有办法从财报中判断一家公司将来的状况了。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
